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C\) ' ABSTRACT 
^ ' We present an algorithm for cluster dynamics to efficiently simulate large systems on 

, MIMD parallel computers with large numbers of processing nodes. The method divides 

physical space into rectangular cells which are assigned to processing nodes and combines 
a serial procedure, by which clusters are labeled locally inside each cell, with a nearest 
neighbor relaxation process in which processing nodes exchange labels until a fixed point 
is reached. By controlling overhead and reducing inter-processor communication this 
£N| 1 method attains good performance and speed-up. The complexity and scaling properties 

of the algorithm are analyzed. The algorithm has been used to simulate large two- 
dimensional Ising systems (up to 27808 X 27808 sites) with Swendsen-Wang dynamics. 
Typical updating times on the order of 82 nanosecs/site and efficiencies larger than 90% 
have been obtained using 256 processing nodes on the CM-5 supercomputer. 
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1. Introduction 

Over the last five years, since the introduction of the Swendsen-Wang (SW) 
algorithm, 1 cluster dynamics have been extended to a variety of models in Sta- 
tistical Mechanics and Field Theory 2 . These new "accelerated dynamics" update 
percolation clusters instead of single spins producing faster decorrelation times and 
reduced critical slowing down 3 . In fact, the dynamic critical exponent z is reduced 
significantly or even eliminated in some cases. This acceleration does not come 
for free; the price to pay is an increased computational complexity compared with 
standard Metropolis, heat bath or other local algorithms. The cluster labeling 
procedure requires a fair amount of non-local (unstructured) data movements that 
makes cluster algorithms intrinsically hard to vectorize or paralellize. It is important 
to develop efficient cluster labeling methods because they are used every time step 
inside the core of the simulations. One has to worry not only about their complexity 
and scaling properties but also about absolute execution times. One of the mod- 
ern challenges of computational science is to find efficient algorithms to exploit the 
unprecedented computational power of today's massively parallel supercomputers. 
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Cluster labeling on a lattice is also relevant for the analysis of structures ob- 
tained in many computer simulations of statistical systems: Ising and percolation 
clusters, 4 nucleation droplets, polymers, crystals, fractal structures, and particle 
tracks in experimental high energy Physics. Cluster labeling is a special case of 
the more general problem of finding the connected components of a graph, 5 which 
has applications in computer vision, image processing and network analysis, among 
others. 

In this paper we propose a simple method for cluster labeling that can be used 
at the core of Monte Carlo simulations on MIMD 1 parallel computers and in par- 
ticular the CM-5 (Thinking Machines Corp.). The algorithm we will present is a 
general method for cluster finding in n— dimensional Euclidean lattices, but we will 
concentrate on the specific problem of cluster labeling for the Ising Model with 
Swendsen-Wang dynamics. Percolation bonds are defined between aligned spins 
with bond probability pbond = 1 — e _2/3 , and the clusters of connected spins, the 
Coniglio-Klein 6 ' 7,8 percolation clusters, are flipped with 50% probability. At the 
critical point, where most of the interesting behavior occurs, the clusters span the 
system and information has to be propagated across the entire computational do- 
main. 



Fig. 1. Typical critical clusters in the two-dimensional Ising model. The lines show the parti- 
tioning of the system into 16 cells (processing nodes). 
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Several methods to find connected components have been introduced in the Com- 
puter Science literature. There are general methods based on union-find, 9 ' 10 transi- 
tive closure, 11 vertex collapse 11,12,13,14 and vector models 15 . We realized that most 
of these methods are not well suited to practical lattice Monte Carlo simulations 
on MIMD machines where one has to worry about absolute execution times over 
average configurations instead of worse cases. Most of these methods are designed 
for idealized PRAM M models and require extensive global communications, shared 
memory or data format transformations. Some other methods are more appropri- 
ate for SIMD" 4 fine-grained machines 15 ' 16 ' 17 ' 25 ' 29 . The choice of data partitioning 
is very important 18 ' 9 . A number of MIMD cluster labeling algorithms, specifically 
designed for cluster Monte Carlo simulations have appeared in the literature, 19,21 
but have very limited speed-up and efficiency which limits their application to the 
simulation of large systems using large numbers of processing nodes. The algorithm 
with the best scaling properties appears to be the self-labeling method introduced 
by Baillie and Coddington 21 . Recently, Kertesz and Stauffer 20 have used the strip 
geometric parallelization method to simulate large systems (6400 2 ) on the Intel 
iPSC/860. These methods have greatly improved our knowledge of the behavior 
of large systems but still lack the required scaling properties to efficiently utilize 
hundreds or thousands of processing nodes. In this paper we propose a method ap- 
propriate for the simulation of large systems on large numbers of processing nodes 
which attains unprecedented speed-up and efficiency. The method is based on a 
rectangular domain decomposition strategy. Similar partitioning techniques have 
been used by Tucker, 18 Embrechts et al, 22 , Baillie and Coddington 21 and by D. Ra- 
paport who has simulated extremely large systems (640000 2 ) by sequentially loading 
one square subsystem at a time on an IBM RS/60 00 23 ' 20 . In our scheme the clusters 
are first labeled locally inside each processing node and then labels are propagated 
across processing nodes by a relaxation process. In section two we describe the 
algorithm and discuss some of its properties. In section three we will analyze its 
time complexity and scaling properties. Numerical results are discussed in section 
four. Finally, section five contains conclusions. 

2. Description of the Algorithm 

Physical space is divided into rectangular cells in such way that each cell is 
assigned to one processing node (see Figure 1). The algorithm labels the clusters 
in two stages: first it finds all the clusters inside each processing node using a serial 
local algorithm, and then it performs a global relaxation process in which processing 
nodes exchange clusters labels with nearest neighbors until a fixed point is reached. 
The operations of the algorithm arc shown in Figures 2 and 3. The procedure can 
be described as follows: 
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Fig. 2. (a) shows the initial state and connectivity of a sub-system on one processing node, (b) 
shows the local roots (in black) and the sites that point to them, (c) illustrates the data structures 
setup for the relaxation cycle. 



Procedure Cluster-Labeling: 

(i) Define connectivity for the sites: for Swendscn-Wang dynamics connectivity 
bonds are defined with probability Pbond = 1 — e~ 2 ^ between aligned spins. 

(ii) In each processing node independently apply the serial algorithm to find 
clusters (Procedure Local). At boundary sites the off- node bonds are ignored. 
At the end all sites are labeled with their corresponding "local roots" , 
which are then globalized 11 . 

(iii) Iterate relaxation cycles, exchanging local root labels with neighboring 
processing nodes until there is no change in any node (Procedure Relax). 
At the end all sites get their final global label from their local roots. 

(iv) Clusters are flipped with 50% probability and measurements of relevant 
quantities are accumulated (energy, magnetization etc.). 



"i.e. made unique over the whole system. This is done by adding an offset equal to the processing 
node number times the size of the local system (n). For the Swendsen-Wang dynamics we also 
multiply the result by two and add a random bit - in this way we can "piggy-back" flipping 
information in the parity of the cluster label. 
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Fig. 3. Processing nodes exchange local root labels with nearest neighbors. Minimum labels are 
found and local root labels are updated. The process is repeated until there are no more changes 
in local root labels. 

The serial local algorithm we used has similarities to both Hoshen-Kopclman 24 
and union-find 10 algorithms™. This particular serial algorithm proved efficient for 
our approach but other serial algorithms can be used. It can be briefly described 
as follows: 

Procedure Local: 

(i) Assign a unique label to each site (e.g. if sites are stored in a one 
dimensional array then the label is the array index of the site itself). 

(ii) For each site, follow the label paths for the site and each of its connected 
neighbors until their roots (sites pointing to themselves) are found. Then 
obtain the minimum label and set the site label, its root, and the roots of 
the connected neighbors to the minimum value. 

(iii) To finish the labeling process a final collapse of trees is done by a pass 
over all sites setting each site to point to its corresponding root. 

After the local labeling is completed (see Fig. 2b) a number of relaxation cycles 
are executed to label the clusters globally. The relaxation procedure consists of the 



"'We also used some ideas from an earlier serial code developed by R. Giles, R. Brower and P. 
Tamayo at Boston University. 
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following: 

Procedure Relax: 

(i) Execute a preparation step to set data structures: for each node boundary 
define a list of pointers to local roots (see Fig. 2c), one pointer per each bond 
crossing node boundaries. 

(ii) Execute a sequence of relaxation cycles until all nodes detect no change 
in the labels (see Fig. 3): 

(a) Each processing node interchanges boundary labels with the neighboring 
node in each direction. The labels are sent in a single block of data using 
synchronous message-passing calls. 

(b) The local root labels are compared with the ones obtained from the 
neighbors and then set to the minimum values. 

We find that even for large numbers of processing nodes the execution time 
is dominated by the local part. Originally we intended to implement a multi- 
grid scheme for the relaxation part, as was used in ref. 25. This extension is 
straightforward and makes the algorithm more scalable but we have found that 
in practice it is not necessary given the sizes of coarse-grained MIMD machines 
available today. Even a large parallel computer with 1024 processing nodes forms 
relatively small lattices in 2 and 3 dimensions which would require only a few 
multigrid levels. Nearest-neighbor relaxation is very efficient for this problem at 
the scale of the processing nodes grid. 

3. Analysis: Complexity and Scaling 

Considering the fact that the cluster labeling algorithm operates at the core of 
equilibrium Monte Carlo simulations the analysis will focus on average case instead 
of worst case performance. The probability of obtaining a given configuration of 
clusters is determined by the Boltzmann weight of that particular configuration. 
The probability of observing the worst case is negligible. The execution times we 
will consider correspond to averages over equilibrium configurations generated in the 
Monte Carlo process. Furthermore, we will analyze the properties of the algorithm 
at the critical point for the Ising model. 

In this section we will make a simplified model for the time complexity of the 
algorithm based on simple scaling arguments. The predictions of the model will be 
compared with experimental results in the next section. Similar analysis for the 
geometric parallelization method have been presented by Burkitt and Heermann 19 
and Jakobs and Gerling 26 . 

In the following discussion, N = L x L is the total size of the system, n = I x I 
is the size for the subsystem in each processing node, and p is the total number 
of processing nodes. Clearly the total number of sites is N — np. In addition, a 
and b are constants. We start by considering the total time to perform the cluster 
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labeling as consisting of two contributions: a "local time", the time spent by the 
serial algorithm inside each processing node, and a "relax time" which is the time 
spent in the global relaxation procedure until completion. The scaling of the local 
part is basically 0{n log*n) where log*n is a very slow growing function"". In 
practice, as it is discussed in ref. 10, the value of log*n can be safely considered a 
constant and the serial time is basically 0(n), 



Tiocal = al 2 — an. (1) 

To compute the relaxation time we need to compute two contributions: t re i ax , 
the time to do one relaxation cycle and n re iax, the number of relaxation cycles 
needed to complete the process. t re iax is proportional to the size of the bound- 
ary between processing nodes, therefore, assuming the communication times grow 
linearly with message size, we have, 



Uelax ~ I ~ n 1 ' 2 . (2) 

If we consider relaxation as a nearest-neighbor propagation of labels, then n re iax 
should be proportional to the maximum depth of the clusters embedded in the lat- 
tice. The maximum depth of the clusters is equivalent to the maximum shortest 
path joining two sites over the set of clusters: X m in', this length, also known as the 
chemical distance 27 , characterizes the way information is transmitted inside a clus- 
ter by step-by-step processes such as nearest neighbor label propagation. Typically 
for percolation clusters, \ m in scales with cluster linear size r with a characteristic 
exponent d min , 



Xmin ~ (3) 

which implies that the number of relaxation cycles to label a given cluster will scale 
with d m i n . Since the clusters can be as large as the entire system (r = L), then 
n re iax will be proportional to L dmin . However, in our case we have relaxation only 
at the scale of the processing nodes, not at the scale of single sites, and then the 
correct length to use in the scaling expression for n re i ax is the renormalized length 



nreiax ~ (L/i)*"'" ~ P^' 2 - (4) 

The connectivity at lower scales is "integrated out" by the local procedure but 
we are assuming that the chemical distance of the resulting renormalized lattice 
still scales with the same exponent d m in- The value of d m i n for 2d Coniglio-Klein 
clusters is reported to be 28 1.08 ± 0.01; consequently we expect n re iax to grow 

vtl \og* n is equal to the number of times the log function has to be applied recursively to the 
argument until it converges to one. 
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slightly faster than the square root of the number of processing nodes 1 '" 1 . Now we 
can express the total relaxation time, T re i ax , as 

Trelax — ^relax treiax = bp mtn l fi I , (5) 

where we can see that d m i n plays the role of a "computational" critical slowing 
down exponent for the algorithm. This is one of the cases in which a physical pa- 
rameter of the simulated system, such as the chemical length exponent, determines 
its computational properties. 

The total time for the parallel algorithm is then the sum of local plus relax 
times, 



'-parallel 



Ti oca i+T relax = an + bp d ^/ 2 n^ 2 . (6) 



If n is large compared with p, and the communication to computation ratio a/b 
is small, the scaling will be dominated by the local part. This can be seen more 
explicitly by calculating the speed-up function S, which is the ratio between the 
serial and parallel times, 

n Tserial 



^parallel 

We will use as T ser { a i the time of the local algorithm running on only one pro- 
cessing node. We can express S as a function of n and p in the following way, 

S ( n 'P) = 1 u 12 1/2 ' ( 8 ) 

an + bp dm '"^n L '^ 

As expected the speed-up improves with large n and gets worse as p increases. 
Usually, the speed-up is computed as a function of p for fixed system size N, 

i \ aNp 

™ = aN + bp ( d ^ + D/2 N l/2- 



The corresponding efficiency Ejy(p) = Sn{p)/p is 

The efficiency decreases as the number of processing nodes increases because 
inter-processor communication times will eventually dominate. In practice it is 
important to make b/a as small as possible. As we will see in the next section, this 
can be done effectively on the CM-5 where b/a ~ 1. It is interesting to notice that 
the efficiency is a universal function of p( d ^^ +1 ) N^ 1 . In terms of efficiency the only 



""The use of a multigrid or hierarchical scheme can make n re i ax scale as log 2 p, see for example 
refs. 25 and 29. 
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Fig. 4. Local, relax and total times for 2d SW dynamics as a function of N 1 / 2 for fixed p = 256. 



Fig. 5. Local, relax and total times for 2d SW dynamics as a function of p for fixed n. 
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important parameter in the simulation is the value of p( d ^^+ 1 ) N 1 : the larger N 
is in relation to p the better the algorithm will perform. 

4. Numerical Results 

We have implemented the algorithm using standard C language plus message- 
passing calls (for inter-processor communication) using the CM-5 CMMD library 111 " . 
Detailed information about the CM-5 and its network architecture can be found in 
ref. 30. 



Fig. 6. The number of relaxation cycles as a function of the number of processing nodes for fixed 
n = 128 2 . The dot-dashed line gives the asymptotic slope predicted by d m i n /2. 

We have performed simulations of the 2d Ising model with Swendsen-Wang 
dynamics at the critical point and measured execution times for different values of 
n and p, and total system sizes N from 256 2 to 27808 2 , using up to 256 processing 
nodes on CM-5 machines without vector units (SPARC node processors). Tables 1 
and 2, and Figures 4 and 5 show times (local, relax, and total) for different system 
sizes and numbers of processing nodes. Measurements times (energy, magnetization, 
etc.) were not included in the timings. As we can see in Figures 4 and 5, local times 
dominate for large systems and we obtain good performance. Typical updating 
times are 314 nanosecs/site for a 64 node CM-5 and 82 nanosecs/site on a 256 node 
CM-5. 



The program is about 600 lines of code and it will be available from the authors. 
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Fig. 7. Total time as a function of inverse temperature (3. 



Fig. 8. Speed-up data for up to 256 processing nodes for different system sizes. 
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Table 1. Timings for 2d SW dynamics. 



p = 64 



N 


local [sees] 


relax [sees] 


total [sees] 


nanosecs/sitc 


256 2 


0.019 


0.010 


0.029 


442 


512 2 


0.076 


0.014 


0.090 


343 


1024 2 


0.307 


0.021 


0.328 


313 


2048 2 


1.264 


0.027 


1.291 


308 


4096 2 


5.188 


0.089 


5.277 


314 


8192 2 


20.94 


0.143 


21.09 


314 



Table 2. Timings for 2d SW dynamics. 



p = 256 



N 


local [sees] 


relax [sees] 


total [sees] 


nanosecs/sitc 


512 2 


0.020 


0.025 


0.045 


172 


1024 2 


0.079 


0.027 


0.106 


101 


2048 2 


0.318 


0.035 


0.353 


84 


4096 2 


1.277 


0.078 


1.355 


81 


8192 2 


5.232 


0.159 


5.391 


80 


16384 2 


21.27 


0.341 


21.61 


81 


27808 2 


62.87 


0.566 


63.44 


82 



Simulating a 27808 2 system requires about 15 Mbytes of local memory on a 256 
node CM-5. This is about 5 bytes per site (4 bytes, one integer, for the label and 
one byte for spin and bond connectivity information) . Using 32 Mbytes of memory 
systems as large as 40600 2 can be simulated on 256 nodes. The local procedure is 
not particularly amenable to vectorization but the use of the parallelism provided 
by the 4 vector units on each CM-5 node, which support integer operations, will 
increase the speed of the local procedure significantly. 

The scaling behavior of the measured times agrees well with the simple scaling 
model of the previous section (Equations 1-5): local times scale linearly with n, 
and relaxation times with n 1 / 2 (see Fig. 4). The number of relaxation cycles as a 
function of p for fixed n approaches the asymptotic slope d m i n /2 as can be seen in 
Fig. 6. From these data we obtain b/a ~ 1. Figure 7 shows the total time as a 
function of temperature where the peak corresponds to the critical point. 

The speed-up Sn(p) as a function of the number of processing nodes for different 
system sizes N is shown in Figure 8. Notice that for large system sizes the speed-up 
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Fig. 9. Efficiency data for up to 256 processing nodes for different system sizes. 

keeps increasing without saturation even for our maximum p equal to 256 process- 
ing nodes. The algorithm has much better speed-up and efficiency than previous 
methods 1 . The 27808 2 lattice is simulated with 92 % efficiency. The functional form 
of these curves is given by Eq. 9. The efficiency En(j>) = Sn(p)/p as a function of p 
is shown in Fig. 9. Simulations of large systems (> 2048 2 ) attain efficiencies higher 
than 90% even for up to 256 processing nodes. When these points are plotted as a 
function of [p( dmin+1 )./V _1 ] 1 / 2 (see Fig. 10) they show the scaling described by Eq. 
10: data for different system sizes collapses reasonably well to a universal curve. In 
this plot, as well as in the others, there are fluctuations in the timings that cause 
small deviations from the expected behavior. This is due to cache memory effects 
and non-linear behavior in the message-passing communications. We have not tried 
to take these effects into account in the scaling model because they are hard to 
estimate and will make the analysis unnecessarily complicated. 

5. Conclusions 

We have found that rectangular domain decomposition and a combination of a 
local method with global relaxation produces much better scaling properties than 
other methods. A careful choice of data structures, partitioning and communication 
strategies is fundamental to produce practical and efficient algorithms for Monte 

l For example compare our Fig. 8 with Fig. 5 of ref. 19 (first paper) or Figures 4, 5 and 6 of ref. 
21. 
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Fig. 10. Universal scaling of efficiency data plotted as a function of 

lp(d min +l) N -^ 1/2 
The solid line is the predicted behavior given by Equation 10 (with a/6 = 1). 

Carlo simulations. A simple scaling model for the complexity of the algorithm gives 
reasonable predictions for the observed times, speed-up and efficiencies. 

Our algorithm running on a 256 node CM-5 is about 79 times faster than a 
Swendsen-Wang program 32 on a Cray-XMP (6.5 ^sees/site), which clearly indi- 
cates that MIMD parallel computers, and the CM-5 in particular, can be used very 
effectively for this kind of computational problems. 

It is interesting to notice that many of the most efficient cluster algorithms, 
SIMD and MIMD, employ relaxation or relaxation plus multigrid methods: Browcr 
et aP 5 (1.5 ^sec/site, 64K CM-2), Baillie and Coddington 21 ' 31 (1 /xsec/site, Symult 
192 nodes and nCUBE-2 64 nodes), Apostolakis et aP< 31 (6.5 ^sees/site, 16K CM- 
2) and this work (82 nanosecs/site, CM-5 256 nodes). Other methods are also 
being improved. Recently, Apostolakis, Coddington and Marinari 31 obtained speeds 
of 1.36 /usecs/site on a 32K CM-2 using a global get/send method. Kertesz and 
Stauffer 20 attained speeds of about 0.5 /xsecs/site with their Intel iPSC/860 (8 
nodes) implementation of the strip geometric parallclization method. 

We hope our cluster labeling method will provide a useful tool for the study of 
large systems and the calculation of critical exponents, correlation functions etc, 
with unprecedented accuracy. It can be applied to simulations of Ising and Potts 
models, 1 ' 2 embedded dynamics simulations of Landau-Ginzburg and Heisenberg 
models, 2 virtual bond percolation dynamics 33 and the study of static and dynamic 
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properties of percolation clusters 4 ' 6 . 
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